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Abstract.^  We  give  a  randomized  algorithm  that  sorts  on  an  N  node  network 

with  constant  valence  in  O(log  N)  time.  More  particularly  the  algorithm 

sorts  N  items  on  an  N  node  cube-connected  cycles  graph  and  for  some 

constant  k  for  all  large  enough  &  it  terminates  within  k<*  log  N  time 
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1.  Introduction 

This  paper  is  concerned  with  the  problem  of  sorting  N  items  in  parallel 

on  a  fixed-connection  graph  G  having  N  nodes  labeled  {0,1, _ ,N-l}  and 

constant  valence.  Each  node  initially  contains  one  key.  The  set  X  of  all  N 
keys  is  assumed  to  have  a  total  ordering  <.  The  network  sorts  by  routing  each 
key  x€x  to  node  j  =rank(x)  where  rank{x)  is  defined  as  |{x*  Gx|x*  <x}|. 
This  can  be  viewed  as  a  distributed  packet  routing  problem.  Each  x  €  X  is 
considered  to  be  an  atomic  packet  that  has  to  be  routed  from  its  initial  node 
to  the  node  corresponding  to  its  rank.  Both  the  rank  computation  and  the  packet 
routing  have  to  be  realized  in  a  completely  distributed  manner. 

We  assume  that  each  node  contains  a  single  sequential  processor  with  local 
storage  for  O(log  N)  packets.  The  processors  are  regarded  as  synchronous  for 
the  purpose  of  step  counting,  but  the  algorithm  itself  does  not  require  it.  In 
unit  time  interval  a  processor  may  transmit  one  of  its  packets  along  a  departing 
edge  and  perform  seme  elementary  operation  such  as  a  comparison.  The  processors 
are  capable  of  generating  random  bits  of  information  and  hence  running 
randomized  algorithms  in  the  sense  of  Rabin  [9]  and  Solovay  and  Strassen  [11]. 

Clearly  the  routing  required  to  sort  may  require  time  at  least  the  diameter 
of  the  graph.  If  G  has  constant  valence  then  the  diameter  is  at  least  fi(log  N) . 
Hence  the  O(log  N)  time  bound  for  our  algorithm  is  asymptotically  optimal.  In 
this  paper  we  restrict  ourselves  to  demonstrating  that  this  bound  is  achievable 
in  principle  and  do  not  pursue  the  issue  of  the  magnitude  of  the  constant 
multipliers.  We  note,  however,  that  it  is  within  a  large  class  of  algorithms 
that  is  experimentally  testable  in  the  sense  of  [13]. 

The  main  components  of  the  algorithm  are  the  splitter  directed  routing 
procedure  SDR  and  the  splitter  finding  procedure  SF  which  itself  uses  SDR. 

They  are  described  and  analyzed  in  Sections  5  and  7  respectively. 
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A  summary  of  the  algorithm  for  sorting  on  the  n-dimensional  cube  connected 
cycles  network  (CCC)  of  Preparata  and  Vuillemin  [8]  is  as  follows.  Note  that 
the  number  of  nodes  is  N = n2n  and  hence  n  <  log  N.  (Logarithms  are  assumed  to 
have  base  2  throughout  this  paper.) 

Step  A:  Call  SF(A).  This  finds  a  set  of  2n/n^  elements  called  "splitters" 
that  divide  X,  when  regarded  as  an  ordered  set,  into  roughly  equal 
intervals. 

Step  B:  Route  each  pacxet  to  a  random  node  and  call  SDR(X)  with  the  splitters 
found  in  Step  A.  This  will  route  the  keys  belonging  to  each  interval 
to  the  6  log n  dimensional  subcube  corresponding  to  it.  In  this  way 
an  approximate  sort  is  achieved,  but  the  keys  are  not  spread  completely 
uniformly  around  the  network. 

Step  C:  Compute  the  rank  of  each  key. 

Step  D:  Route  each  packet  to  the  node  corresponding  to  its  rank. 

The  O(log  N)  behavior  of  each  of  the  four  steps  A-D  will  be  established 
respectively  as  follows:  Theorem  A  (Section  7),  Theorem  B  (Section  5), 

Algorithm  C  (Section  6)  and  Theorem  D  (Section  3) .  We  note  that  Theorem  B  is 
invoked  in  Step  B  with  n-Jt  =  6  logn,  which  is  sufficient  for  the  O(log  N) 
bound.  The  following  then  follows  immediately. 

Main  Theorem.  There  is  a  randomized  algorithm  that  for  some  k  and  all  n  and  all 
sufficiently  large  a  sorts  on  an  n-dimensional  CCC  network ,  and  terminates 
within  kan  steps  with  probability  greater  than  l-2-an. 

Previous  algorithms  for  sorting  N  keys  on  constant  valence  fixed  connection 

2 

networks  of  N  processors  require  time  ft (log  N)  .  The  bitonic  sorter  of 
Batcher  [3]  achieves  this  bound  on  such  networks  as  the  CCC  [8] . 

For  less  realistic  models  of  parallel  computation  faster  algorithms  have 
been  known.  For  example,  J.  Wiedermann  observed  several  years  ago  that  the 
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guicksort  of  Hoare  16]  takes  time  0{log  N)  with  high  likelihood  on  a  parallel 
decision  tree  model.  Reischuk  I 10 1  has  a  related  result  for  a  parallel  random 
access  model. 

Our  current  algorithm  follows  the  randomized  routing  ideas  introduced  in 
[13] .  It  can  be  viewed  as  a  partially  successful  attempt  at  reducing  the 
sorting  problem  to  the  apparently  simpler  problem  of  routing.  In  the  analysis 
the  critical  path  technique  developed  by  Aleliunas  [1]  and  Upfal  [12]  for 
analyzing  routing  in  constant  valence  graphs  plays  an  important  part. 


...L 


2. 


Network  Definitions 


We  define  various  constant  valence  networks  derived  from  the  n-dimensional 
binary  hypercube.  Consider  some  fixed  n^l.  Let  the  node  set  be 

V  =  { (w,i) |w £ {O ,l)n,  i £ {o , . . . ,n-l}} 

which  has  cardinality  N  =  n2n.  For  each  a€V  let  address  (a)  =w  and 

stage  (a)  =i  if  a=  (w,i).  Let  w[i]  be  the  i-th  bit  of  w.  Let  w'  =EXT(w,i) 

be  identical  to  w  except  that  w'  [i]  /w{ij.  Also  let  w  be  the  integer  of 

which  w  is  the  binary  representation. 

We  call  an  edge  from  node  a  to  node  b  internal  if  address (a)  =  address (b) 

and  external  if  address(b)  =  EXT  (address  (a) /Stage  (a)  +1  mod  n).  It  is  forward  if 

stage  (b)  =  stage  (a)  +  1  mod  n  ,  static  if  stage  (b)  =  stage  (a),  arid  reverse  if  stage  (b)  = 

istage (a) -l)mod  n.  The  CCC  network  of  Preparata  and  Vuiilemin  [yj  nas  node  set  v  ana 

- n 

exactly  all  forward  internal  edges  ,  reverse  internal  edges  and  static  external  edges .  For 

ease  of  description  this  paper  will  assume  a  network  more  similar  to  that  of 

Upfal  113]  which  we  call  CCC*.  It  contains  node  set  V  and  all  forward  and 

reverse  internal  edges  and  all  forward  and  reverse  external  edges.  Clearly  any 

algorithm  for  CCC+  can  be  simulated  on  CCC  with  at  most  a  factor  of  two  time 
n  n 

increase.  Finally,  we  define  CCC*  to  be  the  network  obtained  by  taking  a  CCC* 

and  removing  all  edges  that  join  pairs  of  nodes  with  respective  stages  0  and  n-1. 

The  significance  of  CCC*  is  that  numerous  copies  of  it  can  be  found  in  CCC* 

m  n 

if  n > m.  In  particular,  for  any  w  ,  such  that  |w^|  +  |w2|  =  n-m  the 

subgraph  of  CCC*  spanned  by  the  nodes  {  (w^ww^  ,i)  |w  £  {o ,  l}"1  and 

|w,  |<i<|w,|  +  m}  is  isomorphic  to  CCC*. 

11  n 

Note  that  CCC  ,  CCC*  and  CCC*  are  all  naturally  related  to  the  n- 
n  n  n 

dimensional  hypercube  H^.  Intuitively,  for  each  w€{0,l}n  the  set  of  nodes 
{a  €  V | address (a)  =w}  can  be  considered  to  be  a  "supernode"  of  H  .  Each  such 
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supernode  of  is  connected  by  external  edges  to  n  other  supemodes 

{b€  V [ address  (b)  =  EXT(w,i)  for  i  =  0 ,1,. . .  ,n-l}. 

For  any  m  let  {0,l}<m>  be  the  set  of  binary  strings  of  length  not  more 

than  m-1.  We  define  a  subdivision  of  the  node  set  V  that  indexes  the  subsets 

by  binary  strings  from  {0,l}<n+1>.  For  each  w€{0,l}n  let 

V[w]  =  {b  €  vj  address  (b)  =  w  and  stage(b)  =0}.  For  each  w€{0,l}<n>  let 

V[w]  =  {b€v|w  is  a  prefix  of  address  (b)  and  |w[  =  stage  (b)}.  Thus  V[A]  is 

the  set  of  nodes  of  stage  zero  where  A  is  the  empty  string.  Let  root  v[w]  of 

n- | w I  ii 

V[w]  be  the  node  with  address  wO  '  1  and  stage  |w| .  Note  that  for 
|w|  <n-l ,  v{w]  has  a  departing  forward  internal  edge  entering  v[wO]  and  a 
departing  forward  external  edge  entering  v[wl]. 


3. 


Packet  Routing  on  the  CCC * 


This  section  briefly  describes  the  probabilistic  packet  routing  algorithm 
of  Valiant  and  Brebner  [15]  as  applied  to  the  CCC*  by  Upfal  [13] . 

We  require  that  each  node  a£V  contain  for  each  departing  edge  e  a  queue 
Qe  for  the  packets  that  are  to  be  transmitted  across  edge  e.  Each  node  also 
contains  its  address  and  stage  posted  as  local  variables. 

Let  X  be  the  set  of  cN  packets  to  be  routed,  where  each  packet  x£X 
is  initially  at  a  given  node  I  £  V  and  we  wish  x  to  be  routed  to  given 
destination  node  Dx£V.  The  algorithm  has  two  phases: 

A.  (Random  Routing)  Route  x  from  I  to  a  node  R  £  V  with  random 

- 1  x  x 

address . 

B.  (Fixed  Destination  Routing)  Route  x  from  R  to  D  . 

- 2.  X  y 


(fit  _ _ _  J  „„  —  4  mm  A  r>V* 

iUC  lUliUWIII  W4.  A  Alt  ■  -  - 


i  ('Krtr]  Viw  roneaf  i  nrt  f  Ar  r* 
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stages  the  transmission  of  x  across  a  randomly  chosen  departing  forward  edge 
(i.e.,  transmit  x  across  the  forward  internal  edge  or  forward  external  edge 
with  equal  probability).  Phase  B  repeats  for  n  stages  the  following:  if  x 
is  currently  at  node  a  ^  Dx  with  j=  stage  (a)  +1  and  address  (a)  [j  ]  = 
address (Dx) [j ] ,  then  x  is  transmitted  across  the  forward  internal  edge 
departing  v  and  otherwise  x  is  transmitted  across  the  forward  external  edge 
departing  v.  This  takes  the  packets  to  nodes  with  the  correct  addresses. 
Finally,  the  packets  are  pipelined  to  the  nodes  with  correct  stage  by  traversing 
internal  edges. 

We  have  not  yet  specified  the  management  of  the  queues  of  packets  at  each 

node.  Suppose  the  priority  of  packet  x£X  is  assigned  to  be  the  number  of 

stages  of  phases  A  and  B  so  far  accomplished,  and  we  allow  packet  x  to  be 

transmitted  from  each  node  a£V  only  after  all  packets  of  lower  priority  have 

been  transmitted  from  a.  Let  T  ,  T  be  the  total  execution  times  of  phases  A 

A  B 
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and  B  respectively.  The  techniques  of  Aleliunas  II]  and  Upfal  [13]  show  the 
following: 

Theorem  p.  For  come  c>l  for  all  sufficiently  large  a 
Prob(TA>can)  <N  a,  and  Prob (Tg  >  can)  < N  a. 

We  note  that  since  the  first  phase  sends  packets  to  random  addresses  the 
probability  that,  at  its  completion,  there  are  more  than  c^ocn  packets  at  any 
one  node,  or  c2an  packets  at  any  address,  can  be  similarly  bounded  by  N  a 
(for  suitable  constants  c^  and  c^) - 
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4.  Some  Combinatorial  Identities 


We  shall  use  the  following  inequalities.  Let  exp  denote  exponentiation  of 
Euler's  constant  e. 


“lx  — 1  v 

Fact  4.1.  For  all  x  (1+x  )  <e.  For  all  sufficiently  large  x>0  (1+x  )  >e(l-l/(2x)) 

Let  B(m,N,p)  be  the  probability  that  in  N  independent  Bernoulli  trials 

with  probability  p  of  success  there  are  at  least  m  successes. 

Fact  4.2.  (Chemoff  [4]) 


B(m,N,p)  < 


^  exp(-m-Np) 


if  m  >  Npe"* 


Fact  4.3.  ([2])  If  m  =  Np(l  +  B)  where  O^B^l  then 


B(m,N,p)  <  exp(-82Np/2) 

Fact  4.4.  (Hoeffding  [7])  If  we  have  N  independent  Poisson  trials  with 

respective  probabilities  p,  *...»p„  where  Ip.  =  Np  and  if  m^Np  +  1  is  an 

IN  1 

integer  then  the  probability  of  at  least  m  successes  is  at  most  B(m,N,p). 


Fact  4.5.  ([5],  p.  18)  If  n  =  o(N2^3)  then 

(n)  =  +  ~7  exp(-n2/2N) 

Fact  4.6.  Suppose  x<a,  X<A  are  all  functions  of  n  such  that  Xx  =  o(A) 
2/3 

and  X  =  o  (A  '  ).  Let  x  =  aP  +  G ,  X  =  AP+G  where  P  =  (X  +  x) /(A  +  a)  ,  G  =  o(aP) 
and  G  =  o(AP).  Then 

(x)  (x  <  (1  +  od)  )exp(-GZ/AP)  . 

Proof.  Applying  Fact  4.5  gives 
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(x)  ■  (i+o(i))  lr  exp<-x2/2A> 

and 

(x+x)  ^  U  +  o(l))  (X+x)  j  exp (- (X  +  2xX  +  x  )/2A) 

Using  Xx  =  o(A)  and  applying  Stirling's  formula  to  XI,  (X+x)!  and  x!  gives 


(:)(5)/t:i)<(f  f(i)x0£) 


X+x 


(1  +  0(1)) 


Substituting  x  =  aP  +  G  and  X  =  AP  -  G  (or  x  =  aP  -  G  and  X  =  AP+G)  and 
using  Fact  4.1  gives  the  claimed  bound.  o 

We  shall  denote  by  io(l)  any  function  that  tends  to  infinity  as  n-+-°°. 

We  shall  assume  that  ratios  take  integral  values  whenever  this  is  convenient 
and  otherwise  insubstantial. 


i' 
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Let  X  be  a  set  of  cN  keys  that  are  totally  ordered  by  the  relation  <. 

We  assume  that  each  key  x€X  is  initially  located  at  a  random  node  in  V[A] 

chosen  independently  of  any  other  key  in  !X-{x).  Suppose  that  we  are  given  a 

set  of  splitters  Tex  of  size  |Z|  =2  -  1.  We  index  each  splitter  o[w]  EE 

<£> 

by  a  distinct  binary  string  w€{Otl}  of  length  less  than  £.  Let  <• 

<£> 

denote  the  ordering  defined  as  follows:  For  all  w,u,v€{0,l} 
wOu<*w<*wlv.  We  require  that  for  all  w^,w2€{o,l}  0[Wj]<o[w2]  iff 

w^<*w2.  We  assume  that  a  copy  of  each  splitter  CJ[w]  is  already  available 
in  each  node  of  V[w]. 

Let  X[A]  =X  where  A  is  the  empty  string.  Initially  we  assume  that  the 
keys  of  X[A]  are  located  at  V[A] ,  that  is  the  nodes  of  V  having  stage  zero. 
The  splitter  directed  routing  is  executed  in  £  temporally  overlapping  stages 
i«0,l,...,M.  For  each  wEtO.l}1  the  set  of  keys  X[w]  are  all  eventually 
routed  through  V[w].  The  splitter  C[w]  partitions  X[w]  -0[w]  into  disjoint 
subsets 

X{wO]  =  {x E X[w] j x < 0 {w] } 

and 

X{wlJ  =  {xEX[w]  |a[w]  <  x} 

which  are  subsequently  routed  through  V[wO]  and  V[wl)  respectively. 

Suppose  that  a  key  x€X[wl  is  located  at  a  node  a£V[w]  with  address 
ww'  and  stage  i.  Let  B  be  the  first  bit  of  the  address  suffix  w* .  Then 
x  is  transmitted  from  node  a  across  the  departing  forward  internal  edge  if 
B  =  (Olw]  < x) ,  and  x  is  transmitted  across  the  departing  forward  external 
edge  otherwise.  Thus  if  x<o(w)  then  x  is  transmitted  to  a  node  with 
address  prefix  wO,  and  if  0[w]  <x  then  x  is  transmitted  to  a  node  with 


prefix  wl. 
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Mote  that  at  any  one  time  distinct  keys  may  be  at  distinct  stages.  When 

£ 

all  the  keys  have  completed  stage  £-1  the  keys  X-E  are  partitioned  into  2 
disjoint  sin-bets  of  the  form  X[w]  where  w£{o,l}  ,  and  the  keys  X[w]  are 
then  located  within  V[w].  The  following  follows  directly  from  the  assumption 
that  Ofw^]  <0[w2J  if  w^  < *  w^ : 

£ 

Lemma  5,1.  For  any  w^,  w2£{o,l)  if  w^  <•  w2  then  <  x2  fop  a ^ 
Xl€Xtw11  x2£X[w2J. 

Also,  since  each  packet  is  assumed  initially  to  be  at  a  random  node  and 
since  the  above  described  splitter  directed  routing  (SDR)  procedure  does  not 
modify  the  last  n-£  bits  in  the  address  of  a  packet,  we  can  deduce  that: 

Lemma  5.2.  For  each  w€{o,l}  and  each  x£x[w]  SDR  takes  x  to  a  random 
node  in  v[w]  chosen  independently  of  any  other  packet. 

The  SDR  procedure  can  be  viewed  as  a  generalization  of  Phase  B  of  the 
routing  procedure  described  in  Section  3.  It  routes  packets  from  random  source 
nodes  to  specified  destinations  such  that  the  number  of  packets  destined  for 
each  region  is  about  the  same.  The  analysis  used  in  the  proof  is  an  extension 
of  the  techniques  introduced  by  Aleliunas  [1]  and  Upfal  (13)  for  establishing 
good  bounds  for  such  constant  degree  graphs  as  the  CCC  and  d-way  shuffle. 

Theorem  B.  Suppose  we  have  a  network  CCC*  with  a  set  X  of  cn2n  packets 

£ 

and  a  set  E  of  2-1  splitters  where  n  >  £  >  n/2  such  that  for 
all  w£{o,l}®’  |x[w]j<2cn2n  Suppose  that  all  the  remaining 
packets  are  at  independently  chosen  random  nodes  of  v [X J .  If  T 
is  the  total  time  for  execution  of  sdr  then  for  some  c2>  k>0, 
for  all  sufficiently  large  a  and  all  c>l 

Prob(T  >  c2can)  <2  00111  +  exp  (-k.  2n  S*2^an 


u 
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Proof-  First  we  observe  that  since  the  packets  are  randomly  distributed 
initially,  the  probability  that  some  a£V[A]  initially  contains  more  than 
c(a+l)n  keys  is  less  than  2  c(Q+^n  if  a>e2.  This  follows  immediately 
from  Fact  4.2. 

Let  3  =  a+l.  To  each  packet  we  assign  a  random  integer  from  the  set 
l,...,3n  as  its  priority.  Each  packet  has  probability  (£n)  *  independently 
of  being  assigned  any  particular  such  number.  In  SDR  we  will  insist  that  no 
key  be  forwarded  from  a  node  before  all  keys  of  higher  priority  that  ever  visit 
it  have  been  forwarded.  [In  practice  we  simply  forward  the  packets  currently 
at  any  node  in  order  of  their  priority.  This  will  be  at  least  as  fast,  clearly, 

as  the  hypothesized  algorithm  that  prophesies  about  future  arrivals.] 

For  each  node  a  and  priority  it  £  {l, . . .  ,3n}  let  task  T=  (a,ir) 

be  the  job  of  forwarding  all  keys  of  priority  w  that  ever  visit  node  a.  Let 
a  delay  be  any  pair  of  tasks  (T^jT^)  =  ( (a*11)  •  tt>#P) )  where  either  a  =  b  and 

p  =  it  +  l  or  (a,b)  is  an  edge  of  the  network  and  lt=p.  The  two  cases 

correspond  to  the  two  ways  in  which  the  execution  of  a  task  x 2  may  depend  on 
the  completion  of  task  T^.  In  the  first  case  X2  has  to  wait  for  packets  of 
lower  priority  to  be  processed  at  its  node.  In  the  second  case  x2  has  to 
wait  for  the  arrival  of  a  packet  from  an  adjacent  node. 

Let  a  delay  sequence  D  be  a  sequence  of  delays  * 

(x.  »t_),  ~.(T,  _ ,T ,  ,)»(T.  ,  ,T .) .  Note  that  d<fl.  +  |3n  since  in  each  delay  in 
x  i  a-*  a- 1  a- 1  a 

any  such  sequence  either  the  stage  of  the  node  increases  by  one  or  the  priority 
increases  by  one.  Since  there  are  just  two  possible  forward  edges  of  trans¬ 


mission  and  just  one  way  of  increasing  the  priority, 

0  ,  on 

sequences  starting  at  any  one  node  is  at  most  3 


is  at  most 


2n  3*.+6n  ^  25n+2an 


the  total  number  of  delay 
Hence  their  total  number 


* 
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Let  T(D)  be  the  number  of  time  units  (i.e.,  packet  transmissions) 
involved  in  D  (i.e.,  in  To,Tl/' ' * 'T^ *  rema*ns  to  Prove  that  for  some 

c.  for  all  D  and  all  sufficiently  large  c  and  a, 

4 


Prob(T(D)  >  c^can)  <  2 


-3cotn-6n 


for  then  the  probability  that  the  worst  sequence  suffers  that  much  delay  is 
at  most 

-3coin-6n^  2an+5n  <  -an-n 


This  is  proved  under  the  assumption  that  there  are  at  most  c(ot+l)n  packets 
initially  at  any  node.  Since,  as  has  been  observed,  this  event  is  equally 
unlikely  the  result  follows. 

To  establish  the  time  bound  on  T(D)  consider  any  particular  D  and  let 
T_.  =  (a, ir)  where  stage  (a)  =i  be  a  task  in  D.  Let  P^  be  the  set  of  Keys 
that  have  nonzero  probability  of  being  routed  through  T ^  (i.e.,  if  their 
priority  and  initial  position  are  suitably  chosen)  but  would  then  depart  from 
D  at  1^.  Departure  from  D  occurs  either  because  (T ,T  j+^)  =  ( (a,Ti)  ,  (a,iT+l) ) 
(since  the  priority  of  a  packet  cannot  change)  or  because  (t\,t\  +  ^)  = 

( (a, it)  ,  (b,Tr+l) )  but  (a,b)  is  not  the  edge  along  which  the  packet  leaves  node  a. 
Note  that  in  the  latter  case  the  i-th  bit  of  the  destination  address  of  packets 
that  depart  from  D  at  is  different  from  those  that  depart  at  later 

points.  It  is  easily  deduced  that  once  the  priorities  are  fixed,  the  sets 
Pl'P2  "  *  'Pj,  *  ■  .pd  are  Pa^-rw*se  disjoint. 

£ 

Now  Pj  is  just  the  union  of  X[wl  for  various  w£{o,l}  such  that  w  and  a 
agree  in  the  first  i  bits.  By  the  assumption  about  the  size  of  X[w]  it 

follows  that  Ip. I  < 2cn2n  *. 

1  3 

Let  Rj  be  the  set  of  keys  that  have  nonzero  probability  of  being  routed 
through  once  the  priorities  have  been  decided.  Since  the  priorities  are 
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detennined  by  Bernoulli  trials  with  probability  (£n)  Fact  4.2  can  be  used  to 
give  the  following  bound 

Prob(|Rjj  >  4cn2n  *  (Bn)  *)  <exp(-k-2n 

for  an  appropriate  constant  k>0.  The  second  term  in  the  theorem  follows  from 

multiplying  the  above  bound  by  the  number  of  choices  of  D  and  j. 

Finally,  let  be  the  actual  set  of  keys  that  do  depart  from  D  at 

T ^  because  both  the  priority  and  the  initial  positions  were  appropriately 

chosen.  For  each  such  packet  the  initial  position  must  agree  with  a  in  the 

last  n-i  bits.  Hence  K.  is  determined  by  R.  Bernoulli  trials  each  with 

D  J 

probability  2*  n  of  success.  Hence  assuming  | R ^  J  <  4cn2n  ^(Bn)  1  for  each  j 
we  have  Bernoulli  trials  with  expectation  <4c/{3.  To  upper  bound 


£  IV 

we  appeal  to  Hoeffding's  Theorem  (Fact  4.4).  We  have  at  most  cn2n  trials  with 

mean  at  most  (4c/B)  (&+8n)  < 5cn  if  B^4.  Using  Fact  4.2  it  follows  that 

-c  can 

Prob(l|K.|  >  c..ocin)  <2  (1) 

]  ^ 


if  c^a  >  5e  . 

Finally,  we  have  to  consider  the  case  of  packets  being  involved  in  more 
than  one  task  of  D.  This  can  be  done  by  considering  any  fixed  assignment  of 
keys  to  departure  points  in  D  and  considering  the  probabilities  of  repeated 
earlier  involvement  in  D.  If  a  key  was  involved  in  D  at  then  the  probability 

of  a  previous  involvement  at  T ^  ^  is  at  most  one  half  independent  of  subsequent 
involvements.  Hence  if  a  key  was  involved  in  D  at  T ^  then  the  probability  of  t 
previous  involvements  (i.e.  ,  with  T ^  is  at  most  2  t.  It  follows  that 


Prob (T (D)  >  K  +  s  and  EjK  1=  K)  <2  .Prob(E|K  |  >k) 


(2) 
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From  (1)  and  (2)  it  follows  that  if  c^a  >  5e 


Prob(T(D)  >2c3can)  <  .2 


l-c^can 
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6.  Deterministic  Sorting  and  Ranking 

We  use  as  subroutines  some  known  deterministic  algorithms.  A  crucial 
step  in  splitter  finding  is  sorting  a  sparse  subset  of  elements.  For  this 
we  can  use  the  algorithm  of  Nassimi  and  Sahni  [8]. 

Theorem  ns.  For  any  e  >  0  e  keys  can  be  sorted  on  a  cccn  when  N  =  n2n 
in  time  0{n) . 

Step  C  of  the  overall  algorithm  determines  the  rank  of  every  element  given 
that  it  is  "almost'1  sorted.  Suppose  that  for  some  v  we  have  that  all  ele¬ 
ments  are  in  nodes  at  stage  i  and  for  all  w^C'w^  (w^J  =  | |  =  i  the 
keys  in  V[w^]  are  smaller  than  the  keys  in  V[w2).  If  i  =  n  then  we  have 
a  complete  sort  except  that  the  elements  may  not  be  uniformly  distributed 
among  the  stage  0  nodes.  In  this  situation  the  rank  of  each  key  can  be 
determined  by  first  sorting  the  keys  at  each  node  locally.  The  global  rank 
computation  is  performed  on  the  binary  tree  that  has  these  nodes  as  leaves  and 
consists  of  all  forward  internal  edges,  and  just  those  forward  external  edges 
along  which  some  address  bit  changes  from  0  to  1.  The  number  of  keys  in  each 
subcube  can  be  determined  recursively  by  sending  these  sums  from  the  leaves 
toward  the  root  and  accumulating  at  each  internal  tree  node.  Finally  in  a 
reverse  information  flow  from  the  root  to  the  leaves,  the  range  of  the  ranks 
in  each  subcube  can  be  determined,  and  hence  the  ranks  of  the  individual  keys. 
This  all  takes  0(n)  parallel  transfers  of  tokens  that  contain  only  binary 
numbers  of  0(n)  digits. 

In  Step  C  of  the  actual  algorithm  we  start  with  only  a  partial  sort  (i.e., 
for  all  wi<#W2  with  |w^|  =  |  |  *  n-s  where  s  =  6  log2n,  for  all  xCVfw^] 

and  y€V[w2J,  x<y).  To  find  ranks  in  this  situation  we  determine  the  rank 


range  for  each  X[w^]f  sort  each  X[w^],  and  finally  deduce  the  rank  of  each 
element.  The  determination  of  the  rank  ranges  and  final  rank  is  as  described 
in  the  above  paragraph.  With  overwhelming  probability  each  X[w^J  will  have 

g 

at  most  2n2  packets.  For  sorting  X[w^]  we  assign  a  separate  CCC*  to 

it  where  t  =  s  +  log n -  log  s.  At  least  if  t  divides  n,  one  can  find 

n2n/(t22)  disjoint  copies  of  CCC*  in  CCC*.  The  packets  are  routed  to 

their  appropriate  copy  of  CCC*  (Theorem  D)  and  then  sorted  there  by  some  o(n) 

method  such  as  Batcher's  (see  Preparata  and  Vuillemin  [9])  which  takes 
2 

O(log  n)  .  The  above  described  algorithm  for  ranking  the  elements  given  a 
partial  sort  will  be  called  Algorithm  C. 


7. 


Splitter  Finding 

We  describe  a  procedure  SF  that  given  a  CCC*  with  c  packets  at  each 

n  6 

node  finds  a  subset  U  of  2  /n  packets  called  "splitters"  that  divide  the 

ordered  sequence  of  the  cn2n  total  packets  into  intervals  that  are ,  with 

large  probability,  all  of  length  smaller  than  2cn^+\  The  procedure  is 

recursive,  nested  recursive  calls  corresponding  to  nested  subcubes.  At  the 

i-th  level  of  recursion  the  splitters  found  divide  the  ordered  sequence  into 

2nd  ^  roughly  equal  intervals.  The  subcubes  at  the  i-th  level  are  CCC* 

r 

where  r*n/2  li=0,...,log  n  -  log  (26log  n)  ] .  At  the  i-th  level  a  fraction  of 

about  2  1  of  the  packets  are  considered  "active".  The  choice  of  splitters 
at  lower  levels  is  restricted  to  these  active  elements.  In  this  way  the 
average  density  of  active  packets  in  each  CCC*  is  kept  a  constant  c  inde¬ 
pendent  of  the  cube  size.  This  is  necessary  for  the  recursive  procedure  to 
succeed.  Any  integer  greater  than  or  equal  to  six  suffices  as  a  value  of  6. 

The  set  U  of  all  splitters  found  in  a  run  of  SF[X]  will  be  used  in 
Step  B  of  the  overall  sorting  procedure. 

The  procedure  SF  applied  to  the  subcube  with  root  (w ,n-m) ,  where 
|w|  =  n-m,  is  as  follows.  When  the  procedure  is  called  initially  with  w  =  X 
all  the  packets  are  considered  active. 


Procedure 

SF  (w) 

(1) 

Let  Y{w]  be  the  active  packets 

in  V[w] . 

For  each 

x  €  Y  [w] 

route  x 

to  a  random  node  in  V[w]. 

(2) 

For  each  w^,  |w^|  =  m/2 +  2  log  n, 

choose  at 

random  an 

active  element 

2  m/2 

from  V[ww^]  .  Sort  this  set  S*  of  n  2  chosen  elements  using  Theorem  NS. 

xn/2  2 

Route  the  j-th  largest  to  address  w  +  j2  '  /n  .  Let  S  the  newly  created 
set  of  splitters  be  the  packets  at  addresses  w  +  j2m//2  for  j  *  1,. . .  ,2m^2  -  1. 
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If  the  splitter  is  found  at  address  ww  and  w  =  w  lw  where  w  €0  then 

A  1  4  J  3 

the  splitter  is  denoted  by  otww^l  and  routed  deterministically  to  every  node 
in  V  [ww2  ] . 

(3)  For  each  x€Y[w]  -S  decide  according  to  a  Bernoulli  trial  with 
probability  one  half  whether  it  is  to  remain  active.  Let  the  active  subset 
of  Y  [w]  be  Z  {w] . 

(4)  Apply  SDR  with  the  newly  found  splitters  to  z[w]. 

(5)  For  each  w*  with  | w * |  = m/2  let  Y[ww']  be  the  subset  of  Z [w] 
routed  to  subcube  Vlww']  by  (4).  For  each  such  w*  call  in  parallel 
SF(ww')  for  Y  [ww'  ]  as  active  elements,  unless  m=26logn. 

We  have  seen  that  SDR  for  CCCr  takes  time  0(r)  with  overwhelming 
probability.  Theorem  A  will  establish  that  if  SF  is  run,  with  the  recursive 
calls  of  SF  being  allowed  to  be  asynchronous,  then  the  overall  algorithm  runs 
in  time  O(n)  with  large  probability.  The  main  fact  which  has  to  be  established 
(Theorem  7.2)  is  that  with  overwhelming  probability,  at  every  call  of  SDR  the 
hypotheses  of  Theorem  B  are  satisfied.  We  leave  it  to  the  reader  to  verify  that 
all  the  other  operations  performed  in  a  call  of  SF(w)  with  |w|  = n-m  can  be 
achieved  deterministically,  by  pipelining  if  necessary,  in  time  0(m). 

First  we  need  a  technical  lemma: 

Lemma  7.1.  Given  an  ordered,  set  T  suppose  that  a  set  S*  of  n22m^2  elements 

are  then  chosen  from  t  at  random  and  s*  is  then  sorted.  Let  scs*  be  the 

2  2  m/2  2 

subset  of  elements  having  positions  n  ,2n  ,...(2  '  -l)n  in  the  ordered  set. 

Suppose  to'*“'tf+l  t,fie  Icnaest  ordered  subsequence  of  T  such  that 

t0»tf+i  €  S  but  t1,...,tf?S.  Then 

(i)  Prob(f  >  (l+n_1/3)  |T|/2m/2)  =  N_U,(1) 

(ii)  Prob(f <  (l-n_1/3) jT|/2m/2)  =  n"U(1)  . 


£ 


Suppose  that  a  subset  Vct-S  is  chosen  by  performing  independent  Bernoulli 


trials  with  probability  1/2.  Let  yQ, —  '^+1  ^ s  longest  ordered  sub- 


sequence  of 

YUS  such  that 

Vyh+ies 

buz  y1#...,y  gs.  Then 

(iii) 

Prob (h  >  (l+2n"1/3) 

J  Y  |/2 . 2m/2)) 

-  H*™ 

(iv) 

Prob (h <  (l-2n_1/3) 

|YU2.2m/2)) 

=  N-t>d)  _ 

These  claims  assume  that  n^2m/f2  =  o(  j T | )  and  n22m^  =  o{  (t | 2//3)  . 


Proof.  All  choices  of  S*  are  equally  likely.  To  prove  (i)  and  (ii)  consider 
any  sequence  tQ,...,t^+1  with  f  =  (l±n  j  T  J  /2m//2 .  Then  the  probability  that 
of  the  n  2  '  members  of  S*  exactly  n  lie  in  the  above  range  and  the 
rest  outside  is 


(  |tN  )  {l\A  |T| ) 

\  2„m/2  2/  \  2]/  \  2_m/2/ 

\  II  Z  -ll  /  Ml  //  'll  *  f 


2  m/2  2  2  5  /3 

Applying  Fact  4.6  with  A=|t],  a=f,  X  =  n2  -n,  x  =  n  gives  G  =  n  and 
an  upper  bound  of 

exp  (-n4/^3/2) 


provided  n42m//2  =  o(  |t|  )  and  n22m/<2  =  o(  |t  |  2/3) .  This  establishes  (i)  and  (ii) 

since  there  are  at  most  2n  choices  of  tQ,  t  and  f  respectively. 

To  show  (iii)  and  (iv)  it  is  sufficient  to  prove  that  in  a  sequence  of 

(l±n  3//3)  |T|/2m'/2  ordered  elements  of  T  the  probability  that  the  number  of 

elements  chosen  to  be  in  Y  is  outside  the  range  .  (1  ±  2n  1//3)  | T | /2 . 2m/^2  is 
negligible.  In  fact  Lemma  4.3  upper  bounds  this  probability  by 

exp (-n  2/3|T|/(4.2m/2)) 


which  is  bounded  above  by  exp(-n4/^3)  if  2m^2 -n2  =  o(T) . 


□ 
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Theorem  7.2.  In  a  run  of  SF(A)  the  probability  of  each  of  the  follouing 
events  for  each  recursive  call  of  sf(w)  is  bounded  above  by  N-UJ^ 
provided  m>  12  log  n. 

(a)  Step  (ii)  fails  because  V[ww^]  has  no  active  packets. 

(b)  In  the  call  SDR  for  subcube  w  with  w  =  n-m,  it  happens  that 
|ztw]j>2cm2m  or  |  Z  Iwl  J  <  cm2m/2. 

(c)  In  the  call  SDR  for  subcube  w  with  w  =  n-m,  it  happens  that 
| V [w ] j  >  2cn2m. 

Since  in  a  run  of  SF[X]  there  are  at  rr.ost  3N  such  events  altogether  the 

probability  that  any  such  event  ever  occurs  in  a  run  is  therefore  also  bounded 

,  -cu(l) 
by  N 

Proof.  The  proof  proceeds  by  induction  on  the  depth  of  recursion.  We  assume 
that  the  Theorem  holds  down  to  the  current  level  of  recursion  and  argue  that 
the  probability  of  "going  wrong"  at  the  current  call  is  less  than  N 

(a)  Since  the  active  elements  Y [w]  have  been  sent  to  random  nodes  in 
V[w]  the  probability  that  they  all  miss  V[ww^]  is 

m 


(l-l/(2m/2n2))Cm2  /2  . 


By  Fact  4.1  this  is  bounded  above  by 


exp(-cm2m^2/(2n2) )=N 


if  m^l21og  n. 

(b)  We  assume  inductively  that  in  the  call  of  SDR  at  the  i-th  level  of 
recursion  the  set  T  of  elements  in  the  corresponding  subcube  had  size  in  the 
range  (l  +  n  1//^)icn2m  (where  m  =  n/2*).  Applying  lemma  7.1  (ii)  gives  that, 
at  the  next  level  of  recursion,  the  probability  of  a  subcube  having  more  than 
(l  +  n  1/,3)2  a/'2  times  as  many  packets  is  bounded  above  by  N  U^. 


(c)  We  assume  inductively  that  in  the  call  of  SDR  at  the  i-th  level  of 


recursion  the  set  of  active  elements  denoted  again  by  T  in  the  subcube 
corresponding  to  w  is  at  most  (1  +  2n  *cn2ra  \  Then  by  Lemma  7.1  (iii) 


the  probability  that  the  number  of  active  elements  in  a  subcube  at  the  (i+l)-st 

level  call  is  at  most  (l+2n  )2  m^.2  ^  times  this  quantity,  which  is 

_  -1/3. i+1  _m/2- (i+1) 

(l+2n  )  cn2  as  required.  O 


Theorem  A.  For  all  c  there  is’ a  c 5  such  that  for  all  sufficiently  large  B 
if  SF (A)  is  run  on  CCC*  with  c  packets  per  node  then 

a 

Prob(T>c^Bn)  <N 

Proof.  In  a  run  of  SF  a  critical  path  is  a  sequence  of  nested  calls  of 
SF(X;  ,  SF(w^)  ,  SF(w^w^)  ,. . -SF(w^w2. .  .w^)  ,. . .  where  |w^|=n2  ^  .  The  deter¬ 
ministic  components  of  each  take  time  proportional  to  | j -  When  summed  for 
i  =  l,...,log  n  -  log  H21og  n)  this  gives  an  upper  bound  of  0(n)  as  required. 

Hence  it  remains  only  to  analyze  the  cumulative  probabilistic  effects  of  such 
a  chain  of  calls  of  SDR.  Note  that  these  calls  are  probabilisti  ;  *  '  .y  independent. 

Theorem  B  says  that  for  sufficiently  large  'j.  a  call  of  SDR  on  v^.  .  .w^ 
exceeds  runtime  c2an/2x  with  probability  less  than 

2-on/2i  . 


Hence  it  exceeds  runtime 
probability  less  than 


c2n/2i  +  (a-l)c2n/2i  =  c^n/2X  +  t^  (say)  with 


rti/c2 


Hence  the  probability  that  such  a  sequence  of  nested  calls  takes  time  more  than 


c2n  +  t  is  less  th an 
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_ti/C2 
2  1  2  < 


-t/c„ 


£t.=t 

i 


It.=t 

i 


<  2 


-c  .an 

4 


for  some  c,  if  t  =  c-.(a-l)n  and  a>2.  The  result  follows. 

4  ^ 
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